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This article extends Bayarri and Berger's (1999) 
proposal for model evaluation using "partial pos- 
terior" p values to the evaluation of second-stage 
model assumptions in hierarchical models. Applica- 
tions focus on normal-normal hierarchical models, 
although the final example involves an application 
to a beta-binomial model in which the distribution 
of the test statistic is assumed to be approximately 
normal. 

The notion of using partial posterior p values is 
potentially appealing because it avoids what the au- 
thors refer to as "double use" of the data, that is, use 
of the data for both fitting model parameters and 
evaluating model fit. In classical terms, this phe- 
nomenon is synonymous to masking and is widely 
known to reduce the power of test statistics for di- 
agnosing model inadequacy. In the present context, 
masking is avoided by defining the reference distri- 
bution of a test statistic t by the partial posterior 
distribution, defined as 
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Heuristically, the partial posterior distribution con- 
tains information in the data x b s about model pa- 
rameter not reflected in t b s - Prom this definition, 
it follows that the partial posterior distribution and 
(full) posterior distribution are equivalent when t 
is ancillary, and that the partial posterior distribu- 
tion and prior distribution coincide when t is suffi- 
cient. The latter fact suggests that partial posterior 
distributions defined with respect to improper prior 
densities may not be proper when the test statistic 
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is "approximately sufficient" for some subset of pa- 
rameter values. It also precludes the use of partial 
posterior model assessment for objective Bayesian 
models using test statistics that are sufficient, al- 
though the authors presumably regard sufficient test 
statistics as being useful only for assessing the ad- 
equacy of (proper) prior distributions. Nonetheless, 
insight regarding the relative advantages of the pro- 
posed methodology as test statistics vary from be- 
ing "nearly sufficient" to "nearly ancillary" would 
be useful. 

Under regularity assumptions specified in Robins, 
van der Vaart and Ventura (2000), partial posterior 
p values also have the important property of being 
asymptotically uniformly distributed under the null 
model. Prior-predictive p values and their extensions 
to p values based on pivotal quantities (described be- 
low) share this property — even in finite samples, p 
values based on posterior predictive and related ref- 
erence distributions do not, which makes it difficult 
to interpret these diagnostics for purposes of formal 
model assessment. Bayarri and Costellanos (B&C) 
provide convincing examples that illustrate this dif- 
ficulty and highlight the dangers associated with the 
naive use of nonuniform p values. However, it should 
be noted that the extreme p values reported by the 
authors are perhaps also somewhat suspect given 
the relatively small sample sizes considered in the 
examples. That is, even ignoring errors associated 
with the numerical approximation of the partial pos- 
terior density and the resulting distribution of the 
test statistic, asymptotic uniformity of the partial 
posterior p values may not have been achieved to the 
level of accuracy required for the report of partial 
posterior p values down to the number of significant 
digits provided. This concern is heightened by the 
plots in the third column of Figure 1, which suggest 
that partial posterior p values are anticonservative 
for moderate sample sizes. 

The significant advantage of partial posterior p 
values — that of reducing masking — does not come 
without cost, and two potentially difficult tasks must 
be performed to construct these diagnostics. First, it 
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Fig. 1. Quantile-quantile plots of group mean residuals. The top row depicts the qq-plot obtained from three posterior draws 
from the model utilizing O' Hagan' s prior, while the bottom row depicts qq-plots derived from the three draws from the posterior 
defined using the truncated version of B&C's improper prior specification. 



is necessary to estimate the sampling density of the 
chosen test statistic as a function of the model pa- 
rameter 9. In the article, this task is performed only 
for cases in which the sampling density of the test 
statistic can be easily approximated by exploiting a 
translation-invariance property of the normal distri- 
bution. Such a strategy is unlikely to work outside 
of normal family problems or for more sophisticated 
test statistics (e.g., the \ 2 discrepancy function ad- 
vocated in Gelman, Meng and Stern, 1996, or the 
Shapiro- Wilks test statistic illustrated below). 

Second, the partial posterior distribution function 
of the test statistic must be evaluated at its ob- 
served value. Because the partial posterior distri- 
bution is proportional to the ratio of the posterior 
distribution based on the full data to the sampling 
distribution of the test statistic determined in the 
previous step, performing numerical simulations to 
obtain the value of the partial posterior distribution 
function at the observed test statistic is also likely 
to be troublesome. Indeed, even for what are very 
simple hierarchical models, the authors felt obliged 
to provide appendices describing the MCMC algo- 
rithms used to perform these calculations. 

Partial posterior methods also do not seem well- 
suited for the construction of diagnostic plots. Graph- 
ical diagnostics — which are critical to model crit- 
icism and exploration — often involve the display of 
transformations of all data values, and thus are func- 
tions of a sufficient statistic. As noted above, this 
makes the use of partial posterior methods inappro- 
priate for the construction of such plots and may 



limit the utility of this approach in the exploratory 
stages of model refinement. 

A final point that should be considered in the ap- 
plication of partial posterior p values involves the 
trade-off between the cost of computing these diag- 
nostics versus the cost of fitting expanded models 
that have been targeted to detect a particular devi- 
ation from the null model. The example in Section 4 
illustrates this point well. In that example, a normal- 
normal hierarchical model with a fixed second-stage 
mean no is assumed. By conditioning on a test statis- 
tic that represents a component of the sufficient 
statistic that would be used to estimate fio (if its 
value was not known a priori), partial posterior model 
diagnostics overcome the masking effect that plagues 
the other methods considered in the article. How- 
ever, fitting an expanded model in which ft® was 
regarded as random would be several orders of mag- 
nitude easier to implement. It would also provide a 
much cleaner summary of the original model's in- 
adequacy. Although this stylized example was only 
proposed for purposes of illustration, I suspect that 
similar comments might also apply to more elabo- 
rate models. 

As it happens, many of the obstacles associated 
with implementing partial posterior model diagnos- 
tics can be overcome by instead defining model di- 
agnostics using pivotal quantities. Like partial pos- 
terior model diagnostics, Bayesian model diagnos- 
tics based on pivotal quantities also produce test 
statistics that have a known reference distribution. 
The primary drawback of diagnostics based on piv- 
otal quantities is that the joint distribution of piv- 
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otal quantities drawn from the same posterior dis- 
tribution must be evaluated using prior-predictive 
methodology. However, in many cases the reliance 
on prior-predictive assessment can be circumvented 
through the use of probabilistic bounds on distribu- 
tions of dependent order statistics. 

The advantages of diagnostics based on pivotal 
quantities stem from the fact that the distribution of 
a pivotal quantity, say S(x,9), is the same whether 
it is evaluated at the "true" (i.e., data-generating) 
value of the parameter or at a value of 6 drawn from 
the posterior distribution (Johnson, 2007). Further- 
more, many pivotal test statistics are insensitive to 
the choice of end-stage prior distributions in hierar- 
chical models, which makes their use for diagnostics 
in such settings straightforward. To illustrate these 
diagnostics and to demonstrate how they can be 
used to complement information contained in partial 
posterior p values, two of the examples considered in 
B&C are re-evaluated below using diagnostics based 
on pivotal quantities. 

The first example concerns the data and model 
taken from O'Hagan (2003). From the normal- normal 
hierarchical structure of this model, it follows that 
the components of the pivotal vectors 

are marginally distributed as independent, standard 
normal deviates when evaluated at parameter values 
drawn from the posterior distribution, provided only 
that proper prior distributions are assumed for the 
hyperparameters (/x,r). 

Two end-stage priors were assumed for the hyper- 
parameters (h,t) in B&G, one an improper prior 
and the second the informative prior proposed by 
O'Hagan. To replicate findings for the improper pri- 
ors, I assume a priori that 

/ u~L r (— a, a), vr(<7 2 ) cx -tI(1/o, a) 

u 

and 

7t(t 2 ) oc — Z(l/a,a), 

T 

independently for a sufficiently large value of a. Al- 
though the effect of the value of a (or, more gener- 
ally, the limiting process used to obtain an improper 
prior specification) on prior-predictive assessment of 
the joint distribution of pivotal quantities is a topic 
of active research, the marginal distribution of piv- 
otal quantities obtained for a fixed data vector is 
generally insensitive to this choice. 



Quantile-quantile plots of three values of E for 
posterior draws of (/i,r) under the proper and im- 
proper prior specifications appear in Figure 1. A vi- 
sual examination of these plots clearly suggests that 
the fifth group mean is problematic. In practice, the 
evidence provided by these plots — which are typical 
of plots obtained for arbitrary draws of (//, r) from 
the posterior — would be sufficient to trigger an ex- 
amination of the distribution of observations from 
the fifth group. 

The notion of formal Bayesian model assessment 
using p values is a bit oxymoronic, but in the event 
that a Bayesian p value is desired to more formally 
assess the adequacy of these models, samples of piv- 
otal vectors like those displayed in Figure 1 can also 
be used to construct a summary test statistic. For 
normal data, the Shapiro-Wilks statistic (1965) is 
an attractive choice for this purpose. 

Figure 2 displays histogram estimates of the pos- 
terior distribution on the p values obtained by ap- 
plying the Shapiro-Wilks test statistic to a sample 
of 50,000 pivotal vectors E obtained from the poste- 
rior distributions defined using both the proper and 
improper prior specifications. Note that under the 
assumed model assumptions, the marginal distribu- 
tion of each of the p values displayed in this figure 
are (exactly) uniform. 

In general, prior predictive methods are required 
to formally evaluate the joint distributions of piv- 
otal quantities like those displayed in the plots of 
Figure 2. However, prior predictive methods are rel- 
atively computationally expensive to implement. As 
B&C note, they also do not apply to models defined 
using improper prior distributions. To avoid such 
computations, bounds on order statistics from de- 
pendent variables (Caraux and Gascuel, 1992; Rych- 
lik, 1992) can instead be used to obtain a bound 
on the p value associated with the joint distribu- 
tion of a pivotal quantity. In this case, such bounds 
can be used to obtain a p value for the test of the 
null hypothesis that the p values obtained from the 
Shapiro- Wilk statistic were generated from the as- 
sumed model (Johnson, 2007). For the proper and 
(limiting) improper prior specifications, these bounds 
are p < 0.07 and p < 0.05, respectively. Note that 
both of these bounds, as well as the diagnostic plots 
provided in Figure 1, were obtained using only pos- 
terior samples from the assumed model: No addi- 
tional MCMC (or other numerical) simulations were 
required to obtain these results. 
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Fig. 2. p values obtained by applying the Shapiro-Wilks test statistic to second-stage model residuals, p values displayed in 
the left panel were obtained from a model based on a proper prior distribution; the right panel displays p values obtained from 
a model specified with an improper prior distribution. 



Turning now to the hospital mortality data, sup- 
pose that the Jeffreys prior assumed for (a,/3) by 
B&C is truncated to the interval (a, 1/a) x (a, 1/a) 
for a suitably small value of a. When evaluated at 
independent samples of {pi}, a and (5 drawn from 
the posterior, it follows that values of Q defined by 

(3) d = Beta(pi;a,(3), i = l,...,12, 

are marginally distributed as i.i.d. uniform deviates 
under the assumed model. Thus, model adequacy 
can be evaluated by either examining vectors of these 
uniform values in probability plots, or by transform- 
ing their values to a scale appropriate for the model 
at hand. To this end, Figure 3 displays three ran- 
domly selected quantile-quantile plots of posterior 
samples of {pi} against quantiles from the corre- 
sponding Beta(a, (3) distribution. Each of these plots 
suggests that the hospital mortality rates may not 



have been generated from a common beta distribu- 
tion. 

Bayarri and Castellanos' selection of the maxi- 
mum proportion as a test statistic to conduct par- 
tial posterior model checks can be mimicked here by 
selecting the largest uniform deviate from each pos- 
terior sample of quantities in (3) as a summary test 
statistic. It follows that for a single vector Q drawn 
from the posterior, the twelfth order statistic, C(l2)i 
has distribution function F(x) = x 12 . Figure 4 dis- 
plays a quantile-quantile plot of 250,000 £12 values 
drawn from the posterior against the corresponding 
expected order statistics. 

Bounds on the distribution of dependent order 
statistics can again be applied to values displayed in 
Figure 4 to obtain a bound on the p value for model 
fit. For this test statistic, a bound of p < 0.10 is ob- 
tained. As before, calculation of this bound requires 
only output available from the MCMC algorithm 
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Fig. 3. Quantile-quantile plots of hospitality mortality rates. 
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FlG. 4. Quantile-quantile plot of largest uniform statistic 
C(i2) obtained from 250,000 posterior samples. The line in- 
dicated in the plot has slope 1 and intercept 0. 



used to sample from the posterior distribution. No 
additional simulation experiments or numerical ap- 
proximations are needed. 

Returning to a discussion of partial posterior p 
values, methodologies proposed by B&C for assess- 
ing the adequacy of second levels of hierarchical mod- 
els offer important advantages over several compet- 
ing methods, but they also present several practical 
difficulties. 

These difficulties include the following: 

1. Numerical evaluation of partial posterior distri- 
butions is computationally and conceptually chal- 
lenging even in simple normal theory problems. 
Defining appropriate test statistics and estimat- 
ing partial posterior distributions in more com- 
plicated models may prove impracticable. 

2. Nonuniformity of p values in finite samples, cou- 
pled with the numerical approximation of partial 
posterior distribution of the observed test statis- 
tic, makes it difficult to assess the evidence con- 
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tained in small p values. As the authors point out, 
anticonservatism is probably better than conser- 
vatism when diagnosing model fit. But neither is 
good, and the relative error associated with small 
p values is potentially quite large. 

3. Propriety of partial posterior distributions may 
be difficult to establish when objective priors are 
employed, particularly when selected test statis- 
tics represent a component of a sufficient statis- 
tic. 

4. Partial posterior model checks do not naturally 
facilitate graphical diagnostics and other infor- 
mal model checks that are critical to the pro- 
cesses of model refinement and criticism. 

Partial posterior p values do, however, possess an 
important property not shared by many compet- 
ing methods: Partial posterior p values can sub- 
stantially diminish the effects of masking. Indeed, 
evidence provided in the article suggests that par- 
tial posterior p values are an order of magnitude 
less sensitive to masking than p values computed 
using other standard methods. Provided that the 
proposed methodology can be extended to realisti- 
cally complex Bayesian models, this property offers 
assurance that large deviations from model assump- 
tions will not be overlooked simply because, say, a 
variance parameter was overestimated. 
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